##
## The downloaded binary packages are in
## /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages
Libraries Used
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ROCR)
library(ggplot2)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(readr)
library(xgboost)
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
session=list()
for(i in 1:18){
file_path <- paste('/Users/akshajjoshi/AJROOT/UCDAVIS/Code/STA141AProject/Data/sessions/session', i, '.rds', sep='')
session[[i]]=readRDS(file_path)
}
In this study, we look deeply into how the brain’s activities are linked to behavior, such as actions, choices, and how engaged an animal is. This research gives us important new information by mapping out where in the brain these activities happen. Using advanced Neuropixels probes, we studied brain signals from about 30,000 neurons in 42 different areas of the mouse brain. This important work helps us understand how brain functions like seeing, making choices, moving, and paying attention come from different parts of the brain.
This research builds on past studies that mostly looked at single areas of the brain at a time, missing out on how different parts of the brain work together during decision-making. By using these new tools, I could look at how different brain areas work during tasks where mice had to tell the difference between visual signals. They found that while some brain activities are common across many areas when a mouse starts to move, the tasks of seeing something and deciding on an action are handled by specific areas. This helps break down old ideas and shows how complex and varied brain activities are. Their discoveries open new paths for understanding how the brain works, especially in how behaviors are controlled and decisions are made.
In this study, we delve into the intricate relationship between neural activity and behavioral responses using data from experiments conducted on mice. The primary focus is on the analysis of neural spike trains and their correlation with the outcomes of visual stimuli-based decision-making tasks. Utilizing a subset of data from the comprehensive study by Steinmetz et al. (2019), we concentrate on spike trains recorded from the visual cortex of four mice across eighteen experimental sessions. This research integrates advanced computational techniques, including machine learning algorithms such as XGBoost, logistic regression, and random forests, to develop predictive models that assess the likelihood of success or failure in the trials based on neural activity patterns and stimulus contrasts.
Our exploratory data analysis sheds light on the structural and functional nuances of the dataset, including the distribution of neural spikes, variability across trials, and differences among individual mice. The study not only aims to predict trial outcomes effectively but also explores the homogeneity and heterogeneity of neural responses across sessions and subjects. This approach not only enhances our understanding of the neural underpinnings of decision-making but also contributes to the broader field of computational neuroscience by applying data integration techniques and predictive modeling to elucidate complex brain-behavior relationships. ***
Here is a glimpse into data we will be working with for this project. Note all the variables and categories we are working with
head(full_tibble)
## # A tibble: 6 × 13
## brain_area region_sum_spike region_count region_mean_spike trail_id
## <chr> <dbl> <int> <dbl> <int>
## 1 ACA 138 109 1.27 1
## 2 CA3 104 68 1.53 1
## 3 DG 79 34 2.32 1
## 4 LS 269 139 1.94 1
## 5 MOs 106 113 0.938 1
## 6 SUB 156 75 2.08 1
## # ℹ 8 more variables: contrast_left <dbl>, contrast_right <dbl>,
## # feedback_type <dbl>, mouse_name <chr>, date_exp <chr>, session_id <int>,
## # success <dbl>, contrast_diff <dbl>
(session_summary)
## # A tibble: 18 × 7
## session_id mouse_name num_trials mean_contrast_diff mean_success_rate
## <int> <chr> <int> <dbl> <dbl>
## 1 1 Cori 114 0.471 0.605
## 2 2 Cori 251 0.458 0.633
## 3 3 Cori 228 0.398 0.662
## 4 4 Forssmann 249 0.355 0.667
## 5 5 Forssmann 254 0.372 0.661
## 6 6 Forssmann 290 0.363 0.741
## 7 7 Forssmann 252 0.321 0.671
## 8 8 Hench 250 0.545 0.644
## 9 9 Hench 372 0.489 0.685
## 10 10 Hench 447 0.448 0.620
## 11 11 Hench 342 0.433 0.795
## 12 12 Lederberg 340 0.393 0.738
## 13 13 Lederberg 300 0.419 0.797
## 14 14 Lederberg 268 0.396 0.694
## 15 15 Lederberg 404 0.430 0.765
## 16 16 Lederberg 280 0.430 0.718
## 17 17 Lederberg 224 0.464 0.830
## 18 18 Lederberg 216 0.417 0.806
## # ℹ 2 more variables: unique_area <int>, mean_session_spike <dbl>
This is a data frame made to categorize the data from the full_tibble by session so we can get deeper insights into the data
Here are some quick summaries on the count of feedback type (categorized as 1 and -1), the mean contrast for left and right, and the total number of neurons present
cat("There are", total_neurons, "neurons.")
## There are 62 neurons.
(feedback_summary)
## # A tibble: 2 × 2
## feedback_type count
## <dbl> <int>
## 1 -1 14533
## 2 1 34640
(stimuli_summary)
## # A tibble: 1 × 2
## mean_contrast_left mean_contrast_right
## <dbl> <dbl>
## 1 0.347 0.321
Here we explore the relationship between mean success rate and contrast difference and can see there is a fairly postive relationship between the two. Tho it is not linear as the plot suggest, but to show the positve relationship, this worked best.
(success_vs_contrast)
## `geom_smooth()` using formula = 'y ~ x'
Here through the many plots, we look at the data for neural activity in a variety of ways. A few observations made are that it seems that neural activity in general seems to go down or remian about the same as sessions increase. Additionally, most of the activity happens when mean spike rate is low. Finally, of the four mice we see Lederberg having the most success in general.
(neural_activity_by_trial)
## `geom_smooth()` using formula = 'y ~ x'
(distribution_neural)
(contrast_diff)
## `geom_smooth()` using formula = 'y ~ x'
(scatter_plot_faceted)
(neural_activity_success)
(count_by_brain_area)
Here we look at succes rate over time, and we see a pattern of the rate decreasing over many trials for all 4 mice. Specifically, for the mouse Forssmann. This trend in success rate decreasing or remaining stagnant over trials seems to occur across all 18 sessions.
(success_trials)
## `geom_smooth()` using formula = 'y ~ x'
(success_mouse)
## `geom_smooth()` using formula = 'y ~ x'
From these graphs we can see that the mice Hench and Lederberg have the highest distiong brain areas. The success rate graphs also support the theory that lederberg has the most success. Addiotnally, we see that the most region spikes occur in the lower number of trials for all 4 mice.
(region_mean_spike_plot)
(homogeneity_heterogeneity_plot)
(success_rate_session_plot)
(mean_spike_session_plot)
(distinct_areas_plot)
(success_rate_time_plot)
### Dimension Reducing through PCA From the pca plot we can see that
most of the variance in the data is captured in the first two PCs since
that is where the graph drops te most. This tells us that additional
components don’t add much to the data and we should look primarily at
these two dimensions
For data integration, each row represents a trial with trial id as well as session id. From there I include contrast left, right, and difference, average spike rate per time bins. Using these features, I will use this dataframe for as the input upon which my predictive models will be built.
## # A tibble: 6 × 45
## session_id trail_id contrast_right contrast_left contrast_diff bin1 bin2
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0.5 0 0.5 0.0490 0.0368
## 2 1 2 0 0 0 0.0300 0.0313
## 3 1 3 1 0.5 0.5 0.0490 0.0504
## 4 1 4 0 0 0 0.0559 0.0531
## 5 1 5 0 0 0 0.0272 0.0436
## 6 1 6 0 0 0 0.0490 0.0218
## # ℹ 38 more variables: bin3 <dbl>, bin4 <dbl>, bin5 <dbl>, bin6 <dbl>,
## # bin7 <dbl>, bin8 <dbl>, bin9 <dbl>, bin10 <dbl>, bin11 <dbl>, bin12 <dbl>,
## # bin13 <dbl>, bin14 <dbl>, bin15 <dbl>, bin16 <dbl>, bin17 <dbl>,
## # bin18 <dbl>, bin19 <dbl>, bin20 <dbl>, bin21 <dbl>, bin22 <dbl>,
## # bin23 <dbl>, bin24 <dbl>, bin25 <dbl>, bin26 <dbl>, bin27 <dbl>,
## # bin28 <dbl>, bin29 <dbl>, bin30 <dbl>, bin31 <dbl>, bin32 <dbl>,
## # bin33 <dbl>, bin34 <dbl>, bin35 <dbl>, bin36 <dbl>, bin37 <dbl>, …
## [1] train-logloss:0.601631
## [2] train-logloss:0.550638
## [3] train-logloss:0.510053
## [4] train-logloss:0.481637
## [5] train-logloss:0.460702
## [6] train-logloss:0.441478
## [7] train-logloss:0.426525
## [8] train-logloss:0.414870
## [9] train-logloss:0.408038
## [10] train-logloss:0.392890
## Accuracy: 0.7312992
## Reference
## Prediction 0 1
## 0 76 52
## 1 221 667
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = test_label, predictor = predictions)
##
## Data: predictions in 297 controls (test_label 0) < 719 cases (test_label 1).
## Area under the curve: 0.7379
The xgboost model shows an accuracy of 0.7379
## Model accuracy: 0.7214567
## F1 Score: 0.2977667
## Recall: 0.2020202
## Precision Score: 0.5660377
## Misclassification Rate: 0.2785433
Here we see, losgitic regression gives us an accuracy of .72 and a misclassifcation rate of .278 ## Using a Random Forest Model
## Model accuracy: 0.7332677
## Model Accuracy: 0.7332677
## F1 Score: 0.1958457
## Recall: 0.1111111
## Precision: 0.825
## Misclassification Rate: 0.2667323
Finally we see random forest model gives us an accuracy of .733 and a Misclassificationr ate of .267.
test_data=list()
for(i in 1:2){
file_path <- paste('/Users/akshajjoshi/Downloads/test/test', i, '.rds', sep='')
test_data[[i]] <- readRDS(file_path)
}
binename <- paste0("bin", as.character(1:40))
get_trail_functional_data <- function(session_id, trail_id){
spikes <- test_data[[session_id]]$spks[[trail_id]]
if (any(is.na(spikes))){
disp("value missing")
}
trail_bin_average <- matrix(colMeans(spikes), nrow = 1)
colnames(trail_bin_average) <- binename
trail_tibble = as_tibble(trail_bin_average)%>% add_column("trail_id" = trail_id) %>% add_column("contrast_left"= session[[session_id]]$contrast_left[trail_id]) %>% add_column("contrast_right"= session[[session_id]]$contrast_right[trail_id]) %>% add_column("feedback_type"= session[[session_id]]$feedback_type[trail_id])
trail_tibble
}
get_session_functional_data <- function(session_id){
n_trail <- length(test_data[[session_id]]$spks)
trail_list <- list()
for (trail_id in 1:n_trail){
trail_tibble <- get_trail_functional_data(session_id,trail_id)
trail_list[[trail_id]] <- trail_tibble
}
session_tibble <- as_tibble(do.call(rbind, trail_list))
session_tibble <- session_tibble %>% add_column("mouse_name" = session[[session_id]]$mouse_name) %>% add_column("date_exp" = session[[session_id]]$date_exp) %>% add_column("session_id" = session_id)
session_tibble
}
test_full_functional_tibble <- as_tibble(do.call(rbind, session_list))
test_full_functional_tibble$session_id <- as.factor(full_functional_tibble$session_id )
test_full_functional_tibble$contrast_diff <- abs(full_functional_tibble$contrast_left-full_functional_tibble$contrast_right)
test_full_functional_tibble$success <- full_functional_tibble$feedback_type == 1
test_full_functional_tibble$success <- as.numeric(full_functional_tibble$success)
Since xgboost has the highest model accuracy, I will be using that one to for modeling the test data
## Warning in predicted_labels == testing_y$success: longer object length is not a
## multiple of shorter object length
## [1] "Accuracy: 0.665026569572919"
We with our test data, we got an accuracy of 0.665.
Through our initial data analysis we looked at different tresnd in spike rate over session, trials and given other parameters. From there we set out to build a predictive modeling methodology.
Overall, all three of the models tested gave solid accuracy scores and they were all very close to each opther govering around the low 70s. I think with further modeling, a deep learning nueral network or trying a series of other models would offer a lot more insight and create a better way if predicting success rates. Additionally, another important thing to consider going forward is feature engineering so we can preemptively decide whuch features are best suited for modeling. The confusion matrices also showed though that these models have a lot of room for imporvement as there were still high number of incorrect classifcations.
#Acknowledgment: USED CHATGPT for understanding assignment and code